Human Mutation — Latest Matching Preprints

1

A structure-aware framework for genomic variant interpretation in genetic skeletal disorders

Piticchio, S. G.; Hosseini, N.; Grigelioniene, G.; Orellana, L.

2026-03-17 genomics 10.64898/2026.03.15.711892 medRxiv

Top 0.1%

24.1%

Show abstract

BackgroundGenetic skeletal disorders (GSDs) comprise a heterogeneous group of rare, predominantly monogenic conditions that are increasingly diagnosed through high-throughput sequencing. While gene discovery has progressed rapidly, interpretation of pathogenic and uncertain variants remains a major bottleneck, in part because their functional consequences are determined at the protein structure level. However, a systematic assessment of structural knowledge across GSD-associated genes is currently lacking. Here, we present a comprehensive protein structure-centric analysis of 674 protein-coding genes implicated in GSDs. MethodsWe integrated experimental structures, AlphaFold2 (AF2) models, multimeric states, protein-protein interactions, and ClinVar variant annotations. ResultsWe quantify experimental structural availability and sequence coverage, revealing that 37% of GSD proteins lack any experimental structure and that, among proteins with structures, sequence coverage is often incomplete. We show that AF2 models provide high-confidence structural information for a substantial subset of proteins lacking experimental data, but that model reliability strongly correlates with existing structural coverage. Analysis of multimeric assemblies and co-occurring partners demonstrates that many GSD proteins function as obligate multimers, highlighting the importance of interface-level interpretation of variants. Finally, mapping clinically annotated missense variants onto representative protein structures illustrates how structural context can inform the interpretation of pathogenic and uncertain variants, particularly at interaction interfaces. ConclusionsTogether, this work provides a structure-aware reference framework for GSD genes, highlighting systematic gaps in current protein knowledge and demonstrating how integration of structural data can guide genomic variant interpretation. Our observations support a broader principle of structural equivalence, whereby distinct variants converge on shared structural perturbations that explain clustering patterns and enable mechanistic interpretation of nearby variants of uncertain significance.

2

Differential causative effects of germline pathogenic variants in MUTYH and PALB2 in a patient with colorectal polyposis and breast cancer

Camacho Valenzuela, J.; Pelletier, D.; Polak, P.; Fu, L.; Hamel, N.; Domecq, C.; Ahmed, A.; Robles-Espinoza, C. D.; Foulkes, W. D.

2026-05-25 genetic and genomic medicine 10.64898/2026.05.15.26352890 medRxiv

Top 0.1%

17.3%

Show abstract

Purpose Patients carrying Germline Pathogenic Variants (GPVs) in multiple cancer susceptibility genes (CSGs) can be described within the context of Multi-locus Inherited Neoplasia Allele Syndrome (MINAS). The role of each GPV is typically interpreted based on clinical phenotypes. Here, we used tumor sequencing, particularly mutational signatures, to investigate the contribution of GPVs in MUTYH and PALB2 to colorectal polyposis and breast cancer in a single patient at a molecular level. Methods We analyzed tumor sequencing data, including mutational signatures and genomic scars, of a breast tumor and a colorectal polyp from a patient with biallelic GPVs in MUTYH and a heterozygous GPV in PALB2. Results The colorectal polyp showed a dominant contribution of MUTYH-associated Base Excision Repair deficiency (BERd) mutational signatures, with no evidence of Homologous Recombination Repair Deficiency (HRD). In contrast, the breast tumor showed both MUTYH-driven BERd and HRD-associated signatures, including SBS3, ID6 and an elevated HRD score, despite the absence of a detectable second hit in PALB2. These findings suggest a differential contribution from the CSGs, with MUTYH contributing to both lesions and PALB2 contributing specifically to the breast tumor. The observed pattern does not align with the additive or synergistic models described in MINAS. Conclusions Our study provides evidence that mutational signatures can elucidate the contribution of multiple CSGs to tumorigenesis within a single patient. These findings extend current interpretations of MINAS beyond additive or synergistic phenotypes, which may help to better understand tumor etiology, with potential clinical implications, including eligibility for targeted therapies.

3

Cancer Variant Interpretation Group UK (CanVIG-UK): updates on an exemplar national subspecialty multidisciplinary network

Garrett, A.; Allen, S.; Rowlands, C. F.; Choi, S.; Durkie, M.; Burghel, G. J.; Robinson, R.; Callaway, A.; Field, J.; Frugtniet, B.; Palmer-Smith, S.; Grant, J.; Pagan, J.; McDevitt, T.; Hughes, L.; Johnston, E.; Yarram-Smith, L.; Logan, P.; Reed, L.; Snape, K.; Hanson, H.; McVeigh, T. P.; Turnbull, C.; CanVIG,

2026-03-19 genetic and genomic medicine 10.64898/2026.03.17.26348157 medRxiv

Top 0.1%

15.0%

Show abstract

Cancer Variant Interpretation Group UK was established in 2017 in response to the publication of the 2015 ACMG/AMP v3 guidance for the interpretation of sequence variants. Its initial purpose was to ensure consistency in the UK clinical-laboratory community implementation of ACMG/AMP v3 guidance for cancer susceptibility genes (CSGs). Still convening for monthly national meetings, the remit of CanVIG-UK now encompasses additional activities delivered under the following objectives: O_LICreation of a national multidisciplinary professional network and regular forum. C_LIO_LIDelivery of training and education. C_LIO_LIEstablishment of a consensus approach to the fundamentals of variant interpretation in cancer susceptibility genes. C_LIO_LIDevelopment and ratification of gene-specific frameworks for variant interpretation for cancer susceptibility genes. C_LIO_LIDevelopment and maintenance of an online platform to facilitate information sharing and variant interpretation within the UK clinical-laboratory community. C_LIO_LIFacilitation of UK contribution to international variant interpretation endeavours. C_LI A survey of CanVIG-UK members evaluating the impact of these activities conducted in November 2025 had 163 responses, including 113 clinical scientists/trainees and 27 Clinical Genetics consultants/trainees. The utility of the CanVIG-UK consensus recommendations for variant interpretation in cancer susceptibility genes was highly rated, with 89/145=61.4% of survey respondents reporting using the guidance at least weekly ([≥]4 times/month) and 124/128=96.9% rating it as extremely/very useful. The usage frequency and utility of the gene-specific guidance reported by survey respondents were similar to those reported for the main consensus specification. Both qualitative and quantitative survey responses clearly demonstrate the value of the CanVIG-UK activities to the clinical-diagnostic community. Key messagesO_LIWhat is already known on this topic: Cancer Variant Interpretation Group UK (CanVIG-UK) is a national subspeciality multidisciplinary network first established in 2017. It brings together members of the UK clinical-laboratory community to improve accuracy and consistency in the interpretation of variants in cancer susceptibility genes (CSG) C_LIO_LIWhat this study adds: this article presents the results of a survey of CanVIG-UK members, demonstrating the impact of CanVIG-UK activities on their services, as well as a review of progress in the six updated objectives of CanVIG-UK C_LIO_LIHow this study might affect research, practice or policy: this article presents current priorities and practices and potential future directions for variant interpretation in CSGs across the UK and Republic of Ireland C_LI

4

The D4Z4caster DNA methylation signature identifies individuals at epigenetic risk for developing facioscapulohumeral muscular dystrophy (FSHD)

Jones, T. I.; Eriksen, B. Z.; Farooqi, M. N.; Gould, T.; Jones, P. L.; King, O. D.

2026-05-29 genetics 10.64898/2026.05.26.727947 medRxiv

Top 0.1%

12.8%

Show abstract

BackgroundFacioscapulohumeral muscular dystrophy (FSHD) is caused by epigenetic dysregulation at the chromosome 4q35 D4Z4 repeat array under specific permissive genetic conditions. Due to the complexity, expense, and general inaccessibility of FSHD genetic testing, many individuals displaying characteristic muscle weakness are never genetically confirmed and at-risk relatives cannot get screened. We previously developed a targeted bisulfite sequencing (BSS) protocol using the Sanger method to determine DNA methylation levels at specific D4Z4 loci relevant to distinguishing forms of FSHD from non-FSHD that can be used with DNA isolated from saliva, thereby reducing cost and increasing accessibility compared to traditional D4Z4 deletion testing that uses DNA isolated from blood. MethodsHere, we adapt the D4Z4 BSS protocol to next-generation sequencing (NGS) to increase sequencing depth and further reduce cost, validate both sequencing technologies against several cohorts of genetically defined samples, and introduce the D4Z4caster software for computing DNA methylation signatures with diagnostic utility from raw sequencing data. ResultsBoth Sanger and NGS BSS methods using D4Z4caster were validated as providing high sensitivity and specificity, with geometric mean of sensitivity and specificity (G-mean) >95% and area-under-the ROC curve (AUC) of 0.99. The NGS method allows for higher throughput and increased read depth, while the Sanger method allows faster processing of individual samples. Importantly, the NGS method could identify FSHD1 cases that are likely mosaic and would otherwise be missed. ConclusionsD4Z4caster methylation signatures can accurately detect contracted FSHD1-permissive chromosome 4q35 alleles, hypomethylation of D4Z4 arrays indicative of FSHD2, and SNPs that are important for diagnostic use. This workflow is amenable to transitioning to clinical settings for an accurate, low-cost FSHD molecular diagnostic test that could be accessible worldwide. What is already known on this topicCurrently accepted genetic diagnostics for FSHD1 are complex and expensive and can mischaracterize certain complex genetic cases. These diagnostics all require high molecular weight genomic DNA typically freshly isolated from blood, highly specialized equipment, and additional testing for FSHD2, making FSHD diagnostics the most expensive among neuromuscular diseases and inaccessible to much of the world. However, the epigenetic status of the 4q35 and 10q26 D4Z4 repeat arrays, as determined by DNA methylation status using our bisulfite sequencing-based protocol, distinguishes genetically FSHD1, FSHD2, and non-FSHD samples. Additionally, since our protocol is PCR-based, it can utilize DNA isolated from multiple sources, including saliva and buccal swabs. What this study addsThis study validates the relevant DNA methylation signatures against several large cohorts of genetically-confirmed FSHD and non-FSHD samples and optimizes the DNA methylation data analysis for the greater accuracy required for diagnostic utility, including the exclusion of nonpathogenic chromosome 10q or 4A166 contractions. In addition, we introduce the D4Z4caster analysis software, which runs in a portable and scalable Docker container, and provides increased quantitative accuracy important for: 1) confirming likely clinical cases of FSHD that do not meet the currently accepted genetic definition of FSHD1 or FSHD2, 2) identifying FSHD1 somatic mosaicism, and 3) potential prognostic applications. How this study might affect research, practice or policyFSHD1 is genetically defined by a D4Z4 array at the 4q35 locus that is contracted to 1-10 repeat units. However, disease penetrance is influenced by repeat number, epigenetic modifications, and genetic background, causing a misalignment of current genetic diagnosis with clinical diagnosis. This study will improve the accuracy of epigenetic analysis for determining cases of genetic FSHD, help broaden the definition of genetic FSHD to more accurately correspond to clinical FSHD, and allow identification of those at risk for developing clinical FSHD in affected families and in large population studies now being performed and proposed. In addition, it will better inform how an individuals epigenetic status is interpreted for potential prognostic value. Overall, this methodology is: 1) significantly less expensive than current clinically-approved FSHD diagnostic technologies, 2) more accessible due to compatibility with DNA isolated from multiple sources including saliva, and 3) compatible with the current sequencing equipment and workflow for DNA isolation used in commercial clinical laboratories. Together, these advantages will help move the technology toward becoming an approved molecular diagnostic test for FSHD in the USA, Europe, and countries currently lacking clear access to testing.

5

PAVS: A Standardized Database of Phenotype-Associated Variants from Saudi Arabian Rare Disease Patients

Abdelhakim, M.; Althagafi, A.; SCHOFIELD, P.; Hoehndorf, R.

2026-04-06 genetic and genomic medicine 10.64898/2026.04.05.26350189 medRxiv

Top 0.1%

10.3%

Show abstract

Genotype-phenotype databases are essential for variant interpretation and disease gene discovery. Genetic variation differs among human populations, mainly in allele frequencies and haplotype patterns shaped by ancestry and demographic history. Population-specific genotypes can influence traits and disease risk; this makes population specific characterization important. Most existing resources focus on the characterization of a population's genetic background, but do not represent the resulting phenotypes. We have developed PAVS (Phenotype-Associated Variants in Saudi Arabia), a curated, publicly accessible database that integrates 5,132 Saudi clinical cases from four Saudi cohorts and 522 cases from analysis of a mixed-population cohort, together with 1,856 cases from the Deciphering Developmental Disorders study (DDD) and 9,588 literature phenopackets. Each case record describes patient-level phenotypes, encoded with the Human Phenotype Ontology (HPO), and links them to genomic variants, gene identifiers, zygosity, pathogenicity classifications, and disease diagnoses mapped to standardized disease terminologies. The data is represented in Phenopackets format and as a knowledge graph in RDF. Additionally, a web interface provides phenotype-based similarity search, gene and variant browsers, and an HPO hierarchy explorer. We evaluate the utility of the phenotype annotations for gene prioritization using semantic similarity. While there are clear differences to global literature-curated databases, phenotypes in PAVS can successfully rank the correct gene at high rank (ROCAUC: 0.89). PAVS addresses a gap in population-specific genotype-phenotype resources and provides a benchmark for phenotype-driven variant prioritization in under-represented populations.

6

Assessing the clinical significance of a novel rare variant in Loeys-Dietz Syndrome by combining AI-driven modelling and cell biology

Boukrout, N.; Delage, C.; Comptdaer, T.; Arondal, W.; Jemel, A.; Azabou, N.; Bousnina, M.; Mallouki, M.; Sabaouni, N.; Arbi, R.; Kchaou, S.; Ammar, H.; Hantous-Zannad, S.; Jilani, H.; Elaribi, Y.; Benjemaa, L.; Van der Hauwaert, C.; Larrue, R.; CHEOK, M.; Perrais, M.; Lefebvre, B.; Cauffiez, C.; Pottier, N.

2026-03-31 genetic and genomic medicine 10.64898/2026.03.30.26349510 medRxiv

Top 0.1%

9.9%

Show abstract

Loeys-Dietz syndrome (LDS) is an autosomal dominant connective-tissue disorder caused by genetic variants in TGF-{beta} pathway genes, most often TGFBR1/2. While pathogenic TGFBR2 genetic mutations usually cluster in the kinase domain and disrupt SMAD signalling, distinguishing with confidence those with functional impact on TGFBR2 function from rare benign genetic alterations represents one of the most important ongoing challenges for accurate genetic testing. Therefore, there is a pressing need to develop methods that can improve functional variant interpretation. Here, we describe and characterize the functional impact of a novel genetic variant in the TGFBR2 kinase domain (E431K), in a patient with the clinical diagnosis of syndromic genetic aortopathy. We assessed the structural and functional consequences of this variant using AI-driven molecular modelling and in vitro cell-based assays. A high-quality homology-based model of TGFBR2 was generated and computational mutagenesis based on the structural context and evolutionary conservation was used to forecast variant pathogenicity. Relative to wild type, the variant affects protein stability by disrupting intramolecular interactions and likely induces conformational changes that may affect kinase activity and thus TGF-{beta} signalling. This was experimentally confirmed by showing abnormal protein level and alteration of canonical TGF-{beta} pathway activation. Overall, our results establish that the E431K variant leads to aberrant TGF-{beta} signalling and confirm the diagnosis of Loeys-Dietz syndrome type 2 in this patient.

7

The Russian FSHD registry: a first look at the cohort

Kuchina, A.; Sherstyukova, D.; Borovikov, A.; Soloshenko, M.; Zernov, N.; Subbotin, D.; Dadali, E.; Sharkova, I.; Rudenskaya, G.; Kutsev, S.; Skoblov, M.; Murtazina, A.

2026-04-01 genetic and genomic medicine 10.64898/2026.03.31.26349837 medRxiv

Top 0.1%

8.4%

Show abstract

Background: Facioscapulohumeral muscular dystrophy (FSHD) is a common hereditary neuromuscular disorder. The Russian FSHD Patient Registry was established in 2019 following the development of a PCR-based method for genetic confirmation of the diagnosis. Results: The registry included 470 participants (51% male). Genetic confirmation was obtained for 76% (n=356), the remainder were included based on clinical and anamnestic data. Clinical assessment forms and patient-reported questionnaires were analyzed for 310 and 142 patients, respectively. D4Z4 repeat unit (RU) distribution showed patterns consistent with European cohorts, with a predominance of patients with 3 RUs. A moderate inverse correlation was found between RUs number and clinical severity scales. Periscapular weakness was the most common onset manifestation (46.8%), followed by facial weakness (31.6%) which was often unnoticed by patients. The mean age in the Russian cohort was 37.8 years (range 0-97), indicating a younger cohort compared to international data. A delta-adjusted cluster analysis (n=215) identified three distinct trajectories: a classic phenotype with onset before age 14 and early involvement of various muscle groups (n=177), and two clusters characterized by either facial or periscapular onset with slow progression. Conclusion: The Russian FSHD registry provides a comprehensive characterization of a large national cohort, revealing a predominance of patients with 3 D4Z4 repeats and a younger demographic profile compared to international data. Cluster analysis identified three heterogeneous disease trajectories, offering a framework for improved patient stratification.

8

UshEffect-3D: Structure-informed Classification of USH2A Missense Variants for Inherited Retinal Disease

Choudhary, D.; Portelli, S.; Ascher, D. B.

2026-04-27 bioinformatics 10.64898/2026.04.23.720479 medRxiv

Top 0.1%

8.4%

Show abstract

PurposeVariants of uncertain significance (VUS) in USH2A represent a critical interpretive challenge in inherited retinal disease, with over 70% of ClinVar submissions for this gene currently unresolved. We aimed to develop a gene-specific, structure-informed machine learning framework to improve the clinical classification of USH2A missense variant and provide a tractable tool to aid the diagnosis of Usher Syndrome II. MethodsA dataset of 545 curated USH2A missense variants with established clinical classifications was assembled from ClinVar and LOVD. AlphaFold2-predicted domain structures were used to generate local structural descriptors and biochemical features combined with sequence-based evolutionary conservation scores, yielding 153 candidate features reduced to nine via sequential feature selection. Eleven machine learning classifiers were trained using a 10-fold cross-validation strategy, then independently assessed on a blind test set and validated against 78 ACMG-classified pathogenic variants. Model predictions were benchmarked against five general-purpose variant effect predictors and applied to 2639 USH2A VUS from ClinVar. Feature contributions were analysed using SHAP analysis and ablation studies. ResultsThe Random Forest classifier achieved the highest performance on the blind test set, with an MCC of 0.87 and AUC of 0.97. On independent ACMG validation, sensitivity reached 0.73 with perfect precision. UshEffect-3D substantially outperformed all general-purpose predictors, including PolyPhen-2 (MCC = 0.61), AlphaMissense (MCC = 0.42), and ESM-1b (MCC = 0.32). SHAP analysis identified evolutionary conservation as a dominant predictor, with structural stability providing an independent but complementary signal. Applied to 2639 ClinVar VUS, the model prioritised 888 variants (33.6%) as likely pathogenic, particularly enriched within the Laminin N-terminal and Laminin G-like domains. ConclusionsUshEffect-3D demonstrates that gene-specific, structure-informed machine learning substantially outperforms general-purpose variant effect predictors for USH2A missense variant interpretation. This framework provides a high-confidence prioritization resource for the large unresolved VUS burden in this gene to facilitate earlier molecular resolution of USH2A-associated disease. As genedirected therapies for USH2A-associated retinal disease advance toward clinical application, accurate and interpretable variant classification will be essential for equitable patient selection. UshEffect-3D is freely accessible via an interactive web server.

9

Ancestry-stratified variant classification in monogenic diabetes genes: annotation coverage and differential curation burden

Dario, P.

2026-04-07 genetic and genomic medicine 10.64898/2026.04.06.26350230 medRxiv

Top 0.1%

7.2%

Show abstract

Variant databases ClinVar and gnomAD are the backbone of clinical variant interpretation, but their population composition is skewed toward European ancestry. Whether this skew creates systematic classification disadvantages for non-European patients with monogenic diabetes has not been examined at the database level. ClinVar variant_summary (GRCh38, April 2026; 4,421,188 variants) was cross-referenced with gnomAD v4.0 genome data for 17 monogenic diabetes genes. Annotation coverage and variant classification rates were computed stratified by genetic ancestry group (AFR, AMR, EAS, SAS, MID, NFE, FIN, ASJ). Of 14,691 gnomAD variants across the 17 genes, only 29.7% had any ClinVar classification (range: 12.7%-61.3% by gene). Among classified variants, non-Finnish European (NFE) variants had the highest variant of uncertain significance (VUS) rate (32.1%) and the lowest benign/likely benign fraction (41.6%), consistent with a large submission volume without functional follow-up. African-ancestry (AFR) variants showed the second-highest VUS rate (29.2%), not statistically distinguishable from NFE after Bonferroni correction, while all other non-European groups had significantly lower rates (all p < 0.001). GCK showed a pattern inversion - non-European VUS rate (18.5%) exceeding European (15.0%) - consistent with progressive reclassification in European populations absent in non-European cohorts. Annotation coverage and VUS divergence were uncorrelated (r = -0.15, p = 0.57). The primary equity problem is a 70% annotation gap combined with a non-European curation deficit, not a simple VUS excess. Ancestry-stratified evaluation of ClinGen Variant Curation Expert Panel (VCEP) criteria performance is warranted across disease domains.

10

Cortical Organoid Model of PPP2R5D Genetic Intellectual Disability Models Disease Severity Phenotype

Du, Y.; Singh, M.; Patil, M.; Villeagas, I.; Portillo, A.; Shang, K.; Ben-Shalom, R.; Halmai, J.; Fink, K.

2026-05-27 cell biology 10.64898/2026.05.26.728012 medRxiv

Top 0.1%

7.0%

Show abstract

Jordans Syndrome (JS) is a rare, neurodevelopmental disorder caused by de novo missense mutations in protein phosphatase 2 regulatory subunit Bdelta (PPP2R5D). JS is characterized by severe neurological impairments starting in early life. PPP2R5D encodes for B56{delta}, one of the regulatory subunits of protein phosphatase 2A (PP2A). PP2A is a heterotrimeric protein serine/threonine phosphatase that is highly expressed in the brain and the liver. Past studies have focused on PP2As role in liver and little is known about the holoenzymes behavior in neuronal cells. Although B56{delta} is known to play an important role in the substrate specificity of PP2A, the identification of validated downstream substrates in JS remains unclear. To better understand how the mutations affect neuronal cells, we developed cerebral cortical-like organoids from an engineered allele series of the most common JS mutations to characterize the physiological changes throughout different stages of neurodevelopment. Organoids were assessed for transcriptomic, protein, and electrophysiological changes utilizing bulk RNA sequencing, immunocytochemistry, Western Blot, and high-density MicroElectrode Array. The results identify differentially expressed genes and translated proteins, potential neuronal substrates, and significant electrophysiological signatures that suggest mutations in B56{delta} lead to variant-specific dysfunction of PP2A. Overexpression of PPP2R5D through AAV transduction of organoids rescued several phenotypes in the variants, suggesting different pathogenetic etiology underneath. Our findings successfully characterized cerebral cortical-like organoids in JS cell lines and demonstrated its potential as a model for studying neurodevelopmental disorder and for screening therapeutic approaches.

11

Genotype-Based Severity Scoring System in Wolfram Syndrome

Oiknine, L.; Tang, A. F.; Urano, F.

2026-03-26 genetic and genomic medicine 10.64898/2026.03.24.26349216 medRxiv

Top 0.1%

6.9%

Show abstract

Wolfram syndrome is a rare genetic disorder characterized by antibody-negative early-onset atypical diabetes mellitus, optic nerve atrophy, sensorineural hearing loss, diabetes insipidus (arginine vasopressin deficiency), and progressive neurodegeneration, with significant variability in disease severity. We assessed the accuracy of a genotype-based severity scoring system to predict the onset of cardinal symptoms in Wolfram syndrome. This system is based on the type of WFS1 variants (in-frame or out-of-frame) and their location relative to transmembrane domains. Severity scores were assigned to 324 patients with documented onset ages for diabetes mellitus, optic atrophy, hearing loss, and diabetes insipidus. Our analysis revealed a clear correlation between severity scores and earlier onset of diabetes mellitus and optic atrophy. Patients with in-frame variants outside transmembrane domains exhibited milder symptoms, especially WFS1 c.1672C>T (p.Arg558Cys) variant, whereas those with out-of-frame variants showed the earliest onset. Severity scores 3 and 4 did not follow the expected progression, suggesting that transmembrane domain involvement in both alleles may result in greater severity. These findings suggest that this scoring system provides valuable insights into the progression of Wolfram syndrome and may guide clinical care. Further refinement may improve its utility for predicting the onset of non-diabetic symptoms.

12

Building an Interoperable Rare Disease Multi-omic Resource: The GREGoR Data Model and Dataset

Heavner, B. D.; Wheeler, M. M.; Bengtsson, J. D.; Carvalho, C. M. B.; Cheung, W. A.; Conomos, M. P.; Delot, E. C.; DiTroia, S.; Ganesh, V. S.; Gogarten, S. M.; Grochowski, C. M.; Jhangiani, S. N.; King, C. H.; LeMaster, C.; Marvin, C. T.; Marwaha, S.; Miller, D. E.; O'Donnell-Luria, A.; Pais, L.; Patterson, K.; Qi, G.; Richardson, M.; Smail, C.; Stilp, A. M.; Tong, C. C.; Ungar, R. A.; Weisburd, B.; Bamshad, M. J.; Bernstein, J. A.; Eichler, E. E.; Gibbs, R. A.; Lupski, J. R.; May, S. J.; Montgomery, S. B.; Pastinen, T.; Posey, J.; Rehm, H. L.; Shojaie, A.; Talkowski, M. E.; Vilain, E.; Wei, C

2026-05-19 genomics 10.64898/2026.05.15.725546 medRxiv

Top 0.1%

6.7%

Show abstract

Rare disease research and diagnosis rely on the integration of genomic and phenotypic data generated across diverse clinical sites; however, the absence of widely adopted standards for representing genomic data and associated metadata has limited data interoperability, reuse, and cross-study analysis. The Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium was established to investigate challenging rare disease cases and evaluate emerging multi-omic technologies for clinical translation. To support coordinated data integration across distributed research sites, we developed a common Consortium Data Model in partnership with domain experts to standardize the capture of participant-, family-, phenotype- and assay-level metadata, with a particular emphasis on using a modular architecture to support linking of multiple data versions from multiple omic technologies to a single individual and attribution of a genetic finding to the specific technology used for its initial discovery. Adoption of the GREGoR Data Model has enabled continued generation and public release of a harmonized, analysis-ready Consortium Dataset. The most recent release includes phenotypic, family and multi-omic data from 12,292 participants in 5,029 families. Other rare disease data sharing efforts are beginning to adopt this data model which will facilitate cross consortium analyses and empower rare disease research. This work demonstrates that a collaborative, flexible, and scalable data model can enable large-scale rare disease research, facilitate cross-center data harmonization, and enable data interoperability.

13

Biallelic CYB5A disruptions in 46,XY Disorder of Sex Development: Identification and Characterization of a Novel Deep Intronic Variant

Moradifard, S.; LE, T. N. U.; Ha, N. T.; Dung, V. C.; Thao, B. P.; Harley, V. R.

2026-05-12 genetic and genomic medicine 10.64898/2026.05.05.26352416 medRxiv

Top 0.1%

6.4%

Show abstract

BackgroundThe diagnostic yield for 46,XY disorders of sex development (DSD) remains limited. Whole-genome sequencing (WGS) improves detection of both coding and non-coding variants that may be missed by routine testing. Cytochrome b5, encoded by CYB5A, is an essential co-factor for CYP17A1-mediated 17,20-lyase activity. We report on WGS on a Vietnamese family with 46,XY DSD with two siblings presenting with female external genitalia. MethodsClinical assessment and hormone profiling were conducted. WGS was conducted on peripheral blood DNA, in two affected siblings followed by variant annotation and ACMG-based classification. A minigene RNA splicing assay in HEK293 cells was used to evaluate the functional impact of the CYB5A intronic variant. ResultsThe patients hormone profile showed low testosterone and estradiol. WGS identified compound-heterozygous CYB5A variants: a paternally inherited missense variant (p.Val34Glu, likely pathogenic) and a maternally inherited deep intronic deletion (c.129+862_129+863del) for which SpliceAI predicted aberrant splicing. Minigene assays confirmed that the intronic deletion creates cryptic splice sites, resulting in pseudoexon inclusion and a premature stop codon, consistent with nonsense-mediated decay. The intronic variant meets ACMG criteria for pathogenicity. ConclusionThis family expands the spectrum of CYB5A-related DSD and demonstrates that compound-heterozygous variants, including deep intronic defects, can lead to a disruption in 17,20-lyase activity. These findings highlight the importance of WGS and functional assays for identifying clinically relevant non-coding variants in DSD.

14

Identifying disease-causing mechanisms and fundamental biology of neuromuscular disorder genes through genomic feature analysis

Martin, A.; Llanes-Cuesta, M. A.; Hartley, J. N.; Frosk, P.; Drogemoller, B. I.; Wright, G. E. B.

2026-04-22 genetics 10.64898/2026.04.21.719902 medRxiv

Top 0.1%

6.3%

Show abstract

IntroductionNeuromuscular disorders (NMDs) encompass a broad group of conditions that primarily affect the peripheral nervous system. They are often caused by genetic alterations that impair skeletal muscle function and result in debilitating symptoms. Obtaining an accurate molecular diagnosis remains a challenge, potentially because variants in genes that have yet to be identified as causal. We therefore used advanced computational methods to study the genetic architecture of NMDs and to identify key features that distinguish NMD genes from other genes in the broader genome. MethodsCurated genes implicated in NMDs (n = 639; GeneTable of NMDs) were obtained and merged with a comprehensive set of genomic features for human autosomal protein-coding genes. Machine-learning-based feature selection and ranking were performed using Boruta, along with complementary analytical approaches. These analyses were used to identify the most important genic features (n = 134, subcategories: gene complexity, genetic variation, expression patterns, and other general gene traits) for discriminating NMD genes from other genes in the genome ResultsNMD genes exhibit enriched expression in disease-relevant tissues, including skeletal muscle and heart. Additionally, compared with other protein-coding genes, these genes exhibit increased transcriptomic complexity (e.g., longer transcripts and more unique isoforms), contain more short tandem repeats, and show greater variation in conservation across model organisms. ConclusionsThis study identified several key genomic features that may distinguish NMD genes from the rest of the genome. This may enhance the identification of novel causal genes and could ultimately facilitate earlier diagnosis and medical management for affected individuals.

15

Patient iPSC-Derived Cartilage Organoids Reveal Defective ECM Deposition and Altered Chondrogenic Trajectory in Saul-Wilson Syndrome

Mahajan, S.; Ancel, S.; Ascone, G.; Kaur, R.; Torres, J.; Murad, R.; Wang, Y. X.; Ferreira, C. R.; Freeze, H.

2026-04-14 developmental biology 10.64898/2026.04.10.717608 medRxiv

Top 0.1%

6.3%

Show abstract

Saul-Wilson syndrome (SWS) is a skeletal dysplasia characterized by primordial dwarfism and progeroid features caused by a recurrent dominant COG4 variant (p.G516R). We previously showed that this mutation accelerates Golgi retrograde trafficking and disrupts glycosylation of the proteoglycan decorin, while zebrafish models revealed defects in chondrocyte elongation and intercalation. We have also shown that the SW1353 chondrosarcoma cells carrying the SWS variant exhibit reduced secretion of extracellular matrix (ECM) components. While these results indicate a critical function of COG4 in Golgi processing, the developmental process leading to skeletal dysplasia in SWS patients remains unknown. Here, we generated patient-derived iPSC cartilage organoids (SWS organoids), modeling early human chondrogenesis. SWS organoids failed to produce cartilage structures and displayed poor expression of chondrogenic markers. Time-course RNA-seq analysis of the chondrogenic process revealed reduced activation of gene networks involved in skeletal development, ECM organization, ossification, and glycosaminoglycan metabolism. Spatial multiomic analysis of protein and glycosylation by CODEX and GLYPH imaging revealed an altered chondrogenic trajectory, persistence of mesenchymal states, global glycosylation changes, and reduced deposition of chondroitin sulfate proteoglycans. These results indicate that the COG4 mutation disrupts ECM glycosylation and chondrogenic commitment, and that SWS organoids model early defects in cartilage formation underlies impaired skeletal growth in SWS. HighlightsO_LIPatient iPSC-derived cartilage organoids model development defects in Saul-Wilson syndrome C_LIO_LISWS organoids show defective extracellular matrix deposition and attenuated chondrogenic gene expression C_LIO_LIGlycan profiling reveals global glycosylation defects and deficient proteoglycan GAG chains C_LIO_LIAn early developmental impairment in chondrogenesis alters skeletal formation in Saul-Wilson syndrome C_LI

16

Phenotype-Specific Recalibration of MAVE Data Enables Repurposing of BAP1 Functional Assays for Kury-Isidor Syndrome

Gupta, P.; Balton, E. V.; Tejura, M.; Kumar, R. D.; Snyder, M. W.; Stone, J.; Villani, R. M.; Peter, B. H.; Sirisak, C.; Ian, G. A.; Martha, H.-P.; Danny, M. E.; Jane, R.; Elisabeth, R. A.; Andrew, S. H.; Mark, W.; Undiagnosed Diseases Network (UDN), ; Kathleen, L. A.; Matthew, B. D.; Melissa, M. J.; Gail, J. P.; Katrina, D. M.; Elizabeth, B. E.; Fowler, D. M.; Starita, L. M.; McEwen, A. E.; Stergachis, A. B.

2026-05-21 genetic and genomic medicine 10.64898/2026.05.15.26352805 medRxiv

Top 0.1%

5.3%

Show abstract

Purpose Multiplexed assays of variant effect (MAVEs) are transforming clinical variant interpretation. However, many genes are associated with more than one disease, making it unclear whether functional data generated in one disease context may be directly applicable to another. For example, germline BAP1 missense variants are associated with both BAP1 tumor predisposition syndrome (BAP1-TPDS) and Kury-Isidor syndrome (KURIS), a rare neurodevelopmental disorder. Here, we demonstrate how phenotype-specific calibration of BAP1 MAVE data enables disease-specific variant classification. Methods Saturation genome editing (SGE) data for BAP1 were recalibrated using either BAP1-TPDS- or KURIS-associated missense variants as pathogenic controls. Functional evidence strength was quantified using the Odds of Pathogenicity (OddsPath) framework and mapped to ACMG/AMP PS3/BS3 criteria. Recalibrated functional evidence was integrated with standard clinical criteria for variant classification. A workshop was developed to teach phenotype-specific MAVE recalibration to clinicians and variant curators and evaluated for educational impact. Results Phenotype-specific recalibration using BAP1-TPDS and KURIS controls yielded OddsPath values consistent with PS3_Strong evidence in both contexts. Application of KURIS-specific recalibration enabled the diagnosis of KURIS in an individual with a previously uncertain BAP1 missense variant. The educational workshop enabled quantitatively improved understanding in applying functional evidence. Conclusion Phenotype-specific recalibration enables appropriately calibrated reuse of MAVE datasets across distinct disease contexts, increasing the clinical utility of MAVE datasets and the interpretability of variants in pleiotropic genes. This framework expands the diagnostic utility of existing functional datasets without requiring new experimental assays.

17

Toward Early Diagnosis and Therapeutic Discovery in CLN3 Disease: A Computational Biomarker Discovery Framework

Sun, S.; Dang Do, A. N.; Thurm, A.; Soldatos, A.; Zhu, Q.

2026-05-07 genetic and genomic medicine 10.64898/2026.05.01.26352147 medRxiv

Top 0.1%

5.0%

Show abstract

BackgroundCLN3 disease, also known as juvenile neuronal ceroid lipofuscinosis, is a rare and neurodegenerative disorder characterized by the accumulation of lipopigments in the cells, progressive cognitive decline, seizures, and vision loss. Biomarker discovery in CLN3 disease is essential for enabling early and accurate diagnosis, which is critical given its neurodegenerative course. Biomarkers provide objective measures to track disease progression, stratify patients, and serve as surrogate endpoints in clinical trials, thereby accelerating therapeutic development. They also offer valuable insights into underlying disease mechanisms and treatment response, ultimately advancing individualized medicine and improving clinical outcomes. MethodsWe developed various machine learning models to predict potential protein biomarkers in CLN3 disease using proteomics data and laboratory tests collected from participants in a prospective, observational cohort. To prioritize and evaluate these candidates, we conducted protein-protein interaction (PPI) network analysis and pathway enrichment, ranking proteins based on their topological importance. The top 20 proteins were selected as candidate biomarkers and corroborated using a publicly available CLN3 transcriptomic dataset. Receiver operating characteristic (ROC) curve analysis was performed to assess the discriminative power of each candidate, with AUROC values calculated to quantify their classification performance. ResultsOur computational approach identified six promising biomarker candidates: OSM, IL6R, LMNB1, HIF1A, NPM1, and CSF1. Among them, OSM and HIF1A showed marked differential expression in CLN3 patients, particularly those with slow disease progression. LMNB1 expression was elevated in patients with faster disease progression, suggesting its utility as a prognostic biomarker. These findings highlight the robustness of our biomarker selection, indicating that these six genes may serve as effective diagnostic markers for CLN3 disease. ConclusionsOur findings demonstrate the utility of data-driven approaches for biomarker discovery in CLN3 and offer new insights into the molecular mechanisms of the disease, with broader implications for improving diagnosis and prognosis in other rare diseases.

18

MAP3K7 novel variants in syndromic 46,XY DSD

Le, T. N. U.; Moradifard, S. M.; Reyes, A. P.; Ngoc Can, T. B.; Gomes, A. T.; Jones, M. C.; Vu Chi, D.; Harley, V.

2026-05-06 genetic and genomic medicine 10.64898/2026.05.05.26352427 medRxiv

Top 0.1%

4.4%

Show abstract

Mutations in MAP3K7 are responsible for two distinct syndromes Cardiospondylocarpofacial (CSCF) and Frontometaphyseal dysplasia 2 (FMD2). Both are characterized by skeletal malformations, facial dysmorphisms, hearing loss, and mild intellectual disability. While cardiac defects are predominant in CSCF, keloid scar is a distinct feature in FMD2. Problem with gonadal development and disorders of sexual development (DSD) have not been previously chracterized. Here we report three syndromic cases of 46,XY DSD with CSCF or FMD2, each carrying a novel heterozygous missense variants in MAP3K7 (NM_145331.3:c.250G>A; p.V84M, NM_145331.3:c.195A>G; p.I65M, and NM_145331.3: c.574A>G; p.S192G). The DSD phenotypes include cryptorchidism, micropenis, small testis, and hypospadias. In silico tools predict all three variants are deleterious. All three MAP3K7 variants occur in the kinase domain at highly conservative positions among mammals. MAP3K7 is highly expressed in human fetal Sertoli cells. MAP3K7 knock-out in HEK293T cells led to downregulation of GATA4 and FOG2 expression by RNA-Seq. Like MAP3K1, MAP3K7 phosphorylated p38 while all three MAP3K7 variants did not alter phosphorylated p38 compared to wildtype in HEK293TMAP3K7-/- cells. Two MAP3K7 missense mutants (p.V84M and p.I65M) ectopically activate ovarian beta catenin/ Wnt signalling in TOPFLASH assays. Our data suggest that MAP3K7 contributes to male sex differentiation by increasing expression of pro-testis genes GATA4 and FOG2 in HEK293TMAP3K7-/- cells and antagonizing pro-ovarian beta-catenin signalling, and that one or more of these activities were likely affected in 3 cases of 46,XY DSD with CSCF/FMD2 during sex development.

19

Genome-wide detection and clinical prioritization of tandem repeat outliers using long-read sequencing

Gibson, S. B.; Damaraju, N.; Gustafson, J. G.; Balton, E. V.; Chanprasert, S.; Glass, I. A.; Horike-Pyne, M.; Kumar, R. D.; Leppig, K. A.; Lundberg, C.; Ranchalis, J.; Rosenthal, E. A.; Solomon, A. K.; Stergachis, A. B.; Wener, M.; UDN, ; Jarvik, G. P.; Blue, E. E.; Dipple, K. M.; Dashnow, H.; Starita, L. M.; Miller, D. E.

2026-05-01 genetic and genomic medicine 10.64898/2026.04.30.26352103 medRxiv

Top 0.1%

4.3%

Show abstract

BackgroundTandem repeat expansions (TREs) cause over 60 known neurological, neuromuscular, and developmental disorders. Detecting these expansions genome-wide is challenging due to their size, sequence complexity (including interruptions), and population variation. While long-read sequencing is an emerging technology that can fully resolve many TREs, no methods have been described for genome-wide identification and prioritization of candidate pathogenic TREs with this technology. MethodsUsing a newly developed pipeline called TRoLR (Tandem Repeat outliers identified with Long Reads), we analyzed haplotype-resolved long-read genome assemblies from 471 ancestrally diverse individuals to define population distributions for over three million tandem repeat loci, capturing clinically relevant interruptions. Outlier expansions were identified relative to these distributions and prioritized by genomic location and comparison to known pathogenic loci. The framework was applied to 47 cases from the Undiagnosed Diseases Network. ResultsPopulation stratification of repeat metrics was observed at 7% of loci, with highest variability among individuals of African ancestry. Outlier analysis confirmed known pathogenic CNBP and ATXN8OS expansions, detected carrier-range alleles at RFC1, CSTB, and FXN, and revealed a novel CGG expansion in the 5 UTR of PCMTD2 exhibiting hypermethylation and intergenerational instability. Genome-wide screening also identified intronic pentanucleotide expansions at IQCB1 and MAP3K15 in controls composed of motifs that have been associated with pathogenicity at other disease loci. ConclusionsQuantifying the longest uninterrupted repeat segment in long-read assemblies enables detection of clinically relevant repeat expansions and loss of stabilizing interruptions. This approach enhances both diagnostic confirmation and discovery of candidate pathogenic expansions, with implications for clinical interpretation and research into complex repeat-mediated disorders.

20

Duchenne muscular dystrophy is driven by defective membrane repair and annexin-A2 dysregulation in skeletal muscle

Le Quang, M.; d'Agata, L.; Carmeille, R.; Rassinoux, P.; Ruiz, J.; Gounou, C.; Salesses, A.; Bouvet, F.; Mamchaoui, K.; Dovero, S.; Deburgrave, N.; Leturcq, F.; Sole, G.; Martin-Negrier, M.-L.; Bouter, A.

2026-04-23 cell biology 10.1101/2025.09.23.677988 medRxiv

Top 0.1%

4.2%

Show abstract

BackgroundDuchenne muscular dystrophy (DMD) is caused by mutations in the DMD gene, which encodes dystrophin in skeletal muscle cells. Although the role of dystrophin as a structural protein is well known, the cellular processes underlying myofiber degeneration are still not fully understood. Despite advances from studies in murine models, these models do not fully replicate the human pathology. MethodsWe investigated sarcolemmal integrity, membrane repair capacity, and annexin protein expression in DMD patient muscle biopsies and human skeletal muscle cell lines using immunohistochemistry, both shear stress-based and laser irradiation injury assays, western blotting, and live-cell imaging of GFP-tagged annexins. ResultsWe identified defective membrane repair in DMD skeletal muscle cells, independent of increased membrane fragility, by evaluating resealing capacity in control and DMD derived-patient cell lines using both a shear stress assay (N = l2, p < 0.000l) and a laser irradiation assay (N = 3, p < 0.000l). Analyses performed on human DMD muscle biopsies (N = l0) further confirmed this defect, demonstrating massive intracellular IgG uptake (p < 0.000l) together with altered annexin expression profiles. While mechanical stress induces the upregulation of annexin A5 (ANXA5, p < 0.0l) and A6 (ANXA6, p < 0.05) in healthy skeletal muscle cells - suggesting an adaptive response to membrane damage, given the annexin familys central role in membrane repair - we observed dysregulated expression patterns of these proteins in DMD cells. Notably, ANXAl (p < 0.05) and ANXA2 (p < 0.0l) were not only significantly overexpressed but also aberrantly localized to the extracellular space, a putative consequence of defective membrane repair. Since extracellular ANXA2 has been associated with adipocyte accumulation in the muscle tissue of patients with dysferlinopathy, a similar pathological mechanism may be at play in DMD. ConclusionsOur findings propose that ANXA2 contributes to muscle degeneration in DMD and highlight it as a potential therapeutic target to prevent adipogenesis and muscle loss.